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Abstract 

Many  probabilistic  models  for  natural  language 
are  now  written  in  terms  of  hierarchical  tree 
structure.  Tree-based  modeling  still  lacks  many 
of  the  standard  tools  taken  for  granted  in  (finite- 
state)  string-based  modeling.  The  theory  of  tree 
transducer  automata  provides  a  possible  frame¬ 
work  to  draw  on,  as  it  has  been  worked  out  in  an 
extensive  literature.  We  motivate  the  use  of  tree 
transducers  for  natural  language  and  address 
the  training  problem  for  probabilistic  tree-to- 
tree  and  tree-to-string  transducers. 

1  Introduction 

Much  of  natural  language  work  over  the  past  decade  has 
employed  probabilistic  finite-state  transducers  (FSTs) 
operating  on  strings.  This  has  occurred  somewhat  under 
the  influence  of  speech  recognition,  where  transducing 
acoustic  sequences  to  word  sequences  is  neatly  captured 
by  left-to-right  stateful  substitution.  Many  conceptual 
tools  exist,  such  as  Viterbi  decoding  (Viterbi,  1967)  and 
forward-backward  training  (Baum  and  Eagon,  1967),  as 
well  as  generic  software  toolkits.  Moreover,  a  surprising 
variety  of  problems  are  attackable  with  FSTs,  from  part- 
of-speech  tagging  to  letter-to-sound  conversion  to  name 
transliteration. 

However,  language  problems  like  machine  transla¬ 
tion  break  this  mold,  because  they  involve  massive  re¬ 
ordering  of  symbols,  and  because  the  transformation  pro¬ 
cesses  seem  sensitive  to  hierarchical  tree  structure.  Re¬ 
cently,  specific  probabilistic  tree-based  models  have  been 
proposed  not  only  for  machine  translation  (Wu,  1997; 
Alshawi,  Bangalore,  and  Douglas,  2000;  Yamada  and 
Knight,  2001;  Gildea,  2003;  Eisner,  2003),  but  also  for 
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summarization  (Knight  and  Marcu,  2002),  paraphras¬ 
ing  (Pang,  Knight,  and  Marcu,  2003),  natural  language 
generation  (Langkilde  and  Knight,  1998;  Bangalore  and 
Rambow,  2000;  Corston-Oliver  et  al.,  2002),  and  lan¬ 
guage  modeling  (Baker,  1979;  Lari  and  Young,  1990; 
Collins,  1997;  Chelba  and  Jelinek,  2000;  Charniak,  2001; 
Klein  and  Manning,  2003).  It  is  useful  to  understand 
generic  algorithms  that  may  support  all  these  tasks  and 
more. 

(Rounds,  1970)  and  (Thatcher,  1970)  independently 
introduced  tree  transducers  as  a  generalization  of  FSTs. 
Rounds  was  motivated  by  natural  language.  The  Rounds 
tree  transducer  is  very  similar  to  a  left-to-right  FST,  ex¬ 
cept  that  it  works  top-down,  pursuing  subtrees  in  paral¬ 
lel,  with  each  subtree  transformed  depending  only  on  its 
own  passed-down  state.  This  class  of  transducer  is  often 
nowadays  called  R,  for  “Root-to-frontier”  (Gecseg  and 
Steinby,  1984). 

Rounds  uses  a  mathematics-oriented  example  of  an  R 
transducer,  which  we  summarize  in  Figure  1.  At  each 
point  in  the  top-down  traversal,  the  transducer  chooses 
a  production  to  apply,  based  only  on  the  current  state 
and  the  current  root  symbol.  The  traversal  continues 
until  there  are  no  more  state-annotated  nodes.  Non- 
deterministic  transducers  may  have  several  productions 
with  the  same  left-hand  side,  and  therefore  some  free 
choices  to  make  during  transduction. 

An  R  transducer  compactly  represents  a  potentially- 
infinite  set  of  input/output  tree  pairs:  exactly  those  pairs 
(Tl,  T2)  for  which  some  sequence  of  productions  applied 
to  T1  (starting  in  the  initial  state)  results  in  T2.  This  is 
similar  to  an  FST,  which  compactly  represents  a  set  of 
input/output  string  pairs,  and  in  fact,  R  is  a  generalization 
of  FST.  If  we  think  of  strings  written  down  vertically,  as 
degenerate  trees,  we  can  convert  any  FST  into  an  R  trans¬ 
ducer  by  automatically  replacing  FST  transitions  with  R 
productions. 

R  does  have  some  extra  power  beyond  path  following 
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Transducer  alphabet:  {0, 1,  a,  y,  sin,  cos,  plus,  mult) 
Transducer  states:  { d  for  “derive”,  i  for  “identity” ) 
Transducer  rules: 

1.  d  plus(x0,  xl)  — >  plus(d  xO,  d  xl) 


3.  d  sin(xO)  ^  mult(cos(i  xO),  d  xO) 


Transducer  in  action: 


Figure  1:  A  sample  R  tree  transducer  that  takes  the 
derivative  of  its  input. 


and  state-based  record  keeping.  It  can  copy  whole  sub¬ 
trees,  and  transform  those  subtrees  differently.  It  can  also 
delete  subtrees  without  inspecting  them  (imagine  by  anal¬ 
ogy  an  FST  that  quits  and  accepts  right  in  the  middle  of 
an  input  string).  Variants  of  R  that  disallow  copying  and 
deleting  are  called  RL  (for  linear)  and  RN  (for  nondelet¬ 
ing),  respectively. 

One  advantage  of  working  with  tree  transducers  is  the 
large  and  useful  body  of  literature  about  these  automata; 
two  excellent  surveys  are  (Gecseg  and  Steinby,  1984)  and 
(Comon  et  ah,  1997).  For  example,  R  is  not  closed  under 
composition  (Rounds,  1970),  and  neither  are  RL  or  F  (the 
“frontier-to-root”  cousin  of  R),  but  the  non-copying  FL 
is  closed  under  composition.  Many  of  these  composition 
results  are  first  found  in  (Engelfriet,  1975). 

R  has  surprising  ability  to  change  the  structure  of  an 
input  tree.  For  example,  it  may  not  be  initially  obvious 
how  an  R  transducer  can  transform  the  English  structure 
S(PRO,  VP(V,  NP))  into  the  Arabic  equivalent  S(V,  PRO, 
NP),  as  it  is  difficult  to  move  the  subject  PRO  into  posi¬ 
tion  between  the  verb  V  and  the  direct  object  NP.  Eirst,  R 
productions  have  no  lookahead  capability — the  left-hand- 
side  of  the  S  production  consists  only  of  q  S(x0,  xl),  al¬ 
though  we  want  our  English-to- Arabic  transformation  to 
apply  only  when  it  faces  the  entire  structure  q  S(PRO, 
VP(V,  NP)).  However,  we  can  simulate  lookahead  using 
states,  as  in  these  productions: 

-  q  S(x0,  xl)  ^  S(qpro  xO,  qvp.v.np  xl) 

-  qpro  PRO  ^  PRO 

-  qvp.v.np  VP(xO,  xl)  — >  VP(qv  xO,  qnp  xl) 

By  omitting  rules  like  qpro  NP  ^  ...,  we  ensure  that  the 
entire  production  sequence  will  dead-end  unless  the  first 
child  of  the  input  tree  is  in  fact  PRO.  So  finite  lookahead 
is  not  a  problem.  The  next  problem  is  how  to  get  the  PRO 
to  appear  in  between  the  V  and  NP,  as  in  Arabic.  This  can 
be  carried  out  using  copying.  We  make  two  copies  of  the 
English  VP,  and  assign  them  different  states: 

-  q  S(x0,xl)  ^  S(qleft.vp.v  xl,  qpro  xO, 
qright.vp.np  xl) 

-  qpro  PRO  ^  PRO 

-  qleft.vp.v  VP(xO,  xl)  — >  qv  xO 

-  qright.vp.np  VP(xO,  xl)  ^  qnp  xl 

While  general  properties  of  R  are  understood,  there 
are  many  algorithmic  questions.  In  this  paper,  we  take 
on  the  problem  of  training  probabilistic  R  transducers. 
Eor  many  language  problems  (machine  translation,  para¬ 
phrasing,  text  compression,  etc.),  it  is  possible  to  collect 
training  data  in  the  form  of  tree  pairs  and  to  distill  lin¬ 
guistic  knowledge  automatically. 

Our  problem  statement  is:  Given  (1)  a  particular 
transducer  with  productions  P,  and  (2)  a  finite  training  set 
of  sample  input/output  tree  pairs,  we  want  to  produce  (3) 
a  probability  estimate  for  each  production  in  P  such  that 
we  maximize  the  probability  of  the  output  trees  given  the 
input  trees. 


As  organized  in  the  rest  of  this  paper,  we  accomplish 
this  by  intersecting  the  given  transducer  with  each  in¬ 
put/output  pair  in  turn.  Each  such  intersection  produces  a 
set  of  weighted  derivations  that  are  packed  into  a  regular 
tree  grammar  (Sections  3-5),  which  is  equivalent  to  a  tree 
substitution  grammar.  The  inside  and  outside  probabili¬ 
ties  of  this  packed  derivation  structure  are  used  to  com¬ 
pute  expected  counts  of  the  productions  from  the  original, 
given  transducer  (Sections  6-7).  Section  9  gives  a  sample 
transducer  implementing  a  published  machine  translation 
model;  some  readers  may  wish  to  skip  to  this  section  di¬ 
rectly. 

2  Trees 

Ts  is  the  set  of  ( rooted,  ordered,  labeled,  finite)  trees  over 
alphabet  E.  An  alphabet  is  just  a  finite  set. 

T^{X)  are  the  trees  over  alphabet  E,  indexed  by  X — 
the  subset  of  where  only  leaves  may  be  labeled  by 

X.  (rx;(0)  =  Ts.)  Leaves  are  nodes  with  no  children. 

The  nodes  of  a  tree  t  are  identified  one-to-one  with  its 
paths:  pathst  C  paths  =  N*  =  IJ^o  =  {()})■ 

The  path  to  the  root  is  the  empty  sequence  (),  and  pi 
extended  by  p2  is  pi  ■  P2,  where  •  is  concatenation. 

For  p  S  paths t,  rankt{p)  is  the  number  of  chil¬ 
dren,  or  rank,  of  the  node  at  p  in  t,  and  labelfip)  S 
E  U  X  is  its  label.  The  ranked  label  of  a  node  is  the 
pair  labelandrankt{p)  =  (labelt{p),rankt{p)).  For 
1  <  I  <  rankt{p),  the  child  of  the  node  at  p  is 
located  at  path  p  ■  (i).  The  subtree  at  path  p  of  t  is 
tip,  defined  hy  pathstip  =  [q  \  p  ■  q  &  pathst}  and 
labelandranktip{q)  =  labelandrankfip  ■  q). 

The  paths  to  X  in  t  are  pathst{X)  =  {p  G 
pathst  I  labelt{p)  €  X}.  A /ronf/er  is  a  set  of  paths 
/  that  are  pairwise  prefix-independent: 

Vpi,P2  e  f,P  G  paths  :pi=P2-p  Pi  =  P2 

A  frontier  of  t  is  a  frontier  /  C  paths t- 

Forf,s  G  Ty,{X),p  G  pathst,t[p  <—  s]  is  the  sMfisf/fM- 
tion  of  s  for  p  in  t,  where  the  subtree  at  path  p  is  replaced 
by  s.  For  a  frontier  /  of  t,  the  mass  substitution  of  X 
for  the  frontier  f  in  t  is  written  t[p  ^  X,\/p  G  /]  and 
is  equivalent  to  substituting  the  X  (p)  for  the  p  serially  in 
any  order. 

Trees  may  be  written  as  strings  over  E  U  {(,)} 
in  the  usual  way.  For  example,  the  tree  t  = 
S(NP,VP(V,NP))  has  labelandrankt{{2))  =  (VP,  2) 
ai\dlabelandrankt{{2,l))  =  (V,0).Forf  G  Te,(t  G  E, 
cr{f)  is  the  tree  whose  root  has  label  a  and  whose  single 
child  is  t. 

The  yield  ofX  in  t  is  yieldfiX),  the  string  formed  by 
reading  out  the  leaves  labeled  with  X  in  left-to-right  or¬ 
der.  The  usual  case  (the  yield  oft)  is  yieldt  =  yieldtfLi). 


E  =  {S,  NP,  VP,  PP,  PREP,  DET,  N,  V,  run,  the,  of,  sons, 
daughters} 

N  =  {qnp,  qpp,  qdet,  qn,  qprep) 

S  =  q 


{q^lO 

S(qnp,  VP(V(run))), 

qnp- 

>0-6  NP(qdet,  qn). 

qnp- 

>0-4  NP(qnp,  qpp). 

qpp- 

,1.0  pp(qp]-ep,  qnp). 

qdet  - 

4 10  DET(the), 

qprep 

^1.0  pREP(of), 

qn  ^0-5  N(sons), 

qn  ^0-5  N(daughters)} 

Figure  2:  A  sample  weighted  regular  tree  grammar 

(wRTG) 

3  Regular  Tree  Grammars 

In  this  section,  we  describe  the  regular  tree  grammar,  a 
common  way  of  compactly  representing  a  potentially  in¬ 
finite  set  of  trees  (similar  to  the  role  played  by  the  finite- 
state  acceptor  FSA  for  strings).  We  describe  the  version 
(equivalent  to  TSG  (Schabes,  1990))  where  the  generated 
trees  are  given  weights,  as  are  strings  in  a  WFSA. 

A  weighted  regular  tree  grammar  (wRTG)  G  is  a 
quadruple  (E,  N,  S,  P),  where  E  is  the  alphabet,  N  is 
the  finite  set  of  nonterminals,  S  G  N  is  the  start  ( or  ini¬ 
tial)  nonterminal,  and  P  C  N  x  T^{N)  x  R'*'  is  the  finite 
set  of  weighted  productions  (M’*'  =  {r  G  M  |  r  >  0}).  A 
production  {Ihs,  rhs,  w)  is  written  Ihs  rhs.  Produc¬ 
tions  whose  rhs  contains  no  nonterminals  (rhs  G  T^) 
are  called  terminal  productions,  and  rules  of  the  form 
A  — B,  for  A,B  G  N  are  called  e-productions,  or 
epsilon  productions,  and  can  be  used  in  lieu  of  multiple 
initial  nonterminals. 

Figure  2  shows  a  sample  wRTG.  This  grammar  ac¬ 
cepts  an  infinite  number  of  trees.  The  tree  S(NP(DT(the), 
N(sons)),  VP(V(run)))  comes  out  with  probability  0.3. 

We  define  the  binary  relation  =>g  (single-step  derives 
in  G)  on  Ty,{N)  x  (paths  xP)*,  pairs  of  trees  and  deriva¬ 
tion  histories,  which  are  logs  of  (location,  production 
used): 

=>G=  {((a,^),  (b,h-  (p,  (l,r,w)))  \ 
(l,r,w)  G  P  A  p  G  pathsa({l})  A  6  =  a[p  ^  r]  | 

where  (o,  h)  (b,  h  ■  (p,  (I,  r,  w)))  iff  tree  b  may  be 
derived  from  tree  a  by  using  the  rule  I  r  to  replace 
the  nonterminal  leaf  I  at  path  p  with  r.  For  a  derivation 
history  h  =  ((pi,  ((i,  n,  wi)), . . . ,  (p„,  ((i,  n,  wi))), 
the  weight  ofh  is  w{h)  =  Y\a=i  ^  leftmost  if 

L(h)  =  yi<i  <n:  p^+i  Pi-^ 

‘0  <iex  (a),  (ai)  <  lex  (^2)  iff  O'!  Q'2)  (^l)  '  ^1  "^lex 

(a2)  •  h2  iff  ai  <  a2  V  (ai  =  a2  A  6i  <ux  ^2) 


The  reflexive,  transitive  closure  of  is  written 
{derives  in  G),  and  the  restriction  of  to  leftmost 
derivation  histories  is  ^q*  {leftmost  derives  in  G). 

The  weight  of  a  becoming  b  in  G  is  WGici,b)  = 
J2h-{a  ())^^*(&  h)  ths  sum  of  weights  of  all  unique 
(leftmost)  derivations  transforming  a  to  b,  and  the  weight 
oft  in  G  is  WG{t)  =  WG{S,f).  The  weighted  regu¬ 
lar  tree  language  produced  by  G  is  Lq  =  {it,w)  S 
Te  X  R+  I  Wait)  =w}. 

For  every  weighted  context-free  grammar,  there  is  an 
equivalent  wRTG  that  produces  its  weighted  derivation 
trees  with  yields  being  the  string  produced,  and  the  yields 
of  regular  tree  grammars  are  context  free  string  languages 
(Gecseg  and  Steinby,  1984). 

What  is  sometimes  called  2.  forest  in  natural  language 
generation  (Langkilde,  2000;  Nederhof  and  Satta,  2002) 
is  a  finite  wRTG  without  loops,  i.e.,  Vn  €  N{n,  ()) 

{t,h)  pathst{{n})  =  (l>.  Regular  tree  languages 

are  strictly  contained  in  tree  sets  of  tree  adjoining  gram¬ 
mars  (Joshi  and  Schabes,  1997). 


Rules  whose  rhs  are  a  pure  Ta  with  no  states/paths 
for  further  expansion  are  called  terminal  rules.  Rules 
of  the  form  {q,pat)  {q' ,  ())  are  e-rules,  or  epsilon 
rules,  which  substitute  state  q'  for  state  q  without  produc¬ 
ing  output,  and  stay  at  the  current  input  subtree.  Multiple 
initial  states  are  not  needed:  we  can  use  a  single  start 
state  Qi,  and  instead  of  each  initial  state  q  with  starting 
weight  w  add  the  rule  {Qi,TRUE)  {q,  ())  (where 
TRUE{t)  =  l,Vf). 

We  define  the  binary  relation  for  xR  tranducer  X 
on  TeuAuQ  X  {paths  x  R)*,  pairs  of  partially  transformed 
{working)  trees  and  derivation  histories: 


<  {{a,  h),  {b,  h  ■  {i,  {q,pat,  r,  w)))) 


b  = 


{q,  pat,  r,w)  G  R  Ai  G  pathsa  A 
q  =  labela{i)  Apat{a  J,  {i  ■  (1)))  =  1  A 

P  ^  q'{a  i  {i  ■  (1)  •■*'))> 

Vp  :  labelr{p)  =  {q',i') 


4  Extended-LHS  Tree  Transducers  (xR) 

Section  1  informally  described  the  root-to-frontier  trans¬ 
ducer  class  R.  We  saw  that  R  allows,  by  use  of  states, 
finite  lookahead  and  arbitrary  rearrangement  of  non¬ 
sibling  input  subtrees  removed  by  a  finite  distance.  How¬ 
ever,  it  is  often  easier  to  write  rules  that  explicitly  repre¬ 
sent  such  lookahead  and  movement,  relieving  the  burden 
on  the  user  to  produce  the  requisite  intermediary  rules 
and  states.  We  define  xR,  a  convenience-oriented  gener¬ 
alization  of  weighted  R.  Because  of  its  good  fit  to  natu¬ 
ral  language  problems,  xR  is  already  briefly  touched  on, 
though  not  defined,  in  (Rounds,  1970). 

A  weighted  extended-lhs  root-to-frontier  tree  trans¬ 
ducer  AT  is  a  quintuple  (S,  A,  Q,  Qi,  R)  where  E  is  the 
input  alphabet,  and  A  is  the  output  alphabet,  Q  is  a  fi¬ 
nite  set  of  states,  Qi  G  Q  is  the  initial  (or  start,  or  root) 
state,  and  R  C  Q  x  XRPATs  x  Ta{Q  x  paths)  x  IR.+ 
is  a  finite  set  of  weighted  transformation  rules,  written 
{q,  pattern)  rhs,  meaning  that  an  input  subtree 

matching  pattern  while  in  state  q  is  transformed  into 
rhs,  with  Q  x  paths  leaves  replaced  by  their  (recursive) 
transformations.  The  Qx  paths  leaves  of  a  rhs  are  called 
nonterminals  (there  may  also  be  terminal  leaves  la¬ 
beled  by  the  output  tree  alphabet  A). 

XRPAT E  is  the  set  of  finite  tree  patterns:  predicate 
functions  /  :  Te  ^  {Oj  1}  that  depend  only  on  the  la¬ 
bel  and  rank  of  a  finite  number  of  fixed  paths  their  in¬ 
put.  xR  is  the  set  of  all  such  transducers.  R,  the  set 
of  conventional  top-down  transducers,  is  a  subset  of  xR 
where  the  rules  are  restricted  to  use  finite  tree  patterns 
that  depend  only  on  the  root:  RPATy,  =  {p<T,r{t)}  where 
Pa,r{t)  =  (labeltiO)  =  a  A  rankt{{))  =  r). 


That  is,  b  is  derived  from  a  by  application  of  a  rule 
{q,pat)  r  to  an  unprocessed  input  subtree  a  [  i 
which  is  in  state  q,  replacing  it  by  output  given  by  r,  with 
its  nonterminals  replaced  by  the  instruction  to  transform 
descendant  input  subtrees  at  relative  path  i'  in  state  q' . 
The  sources  of  a  rule  r  =  {q,  I,  rhs,  w)  G  i?  are  the  input- 
path  parts  of  the  rhs  nonterminals: 

sources{rhs)  =  {*^  |  3p  G  pathSrhsiQ  x  paths), 

q'  £Q  :  labelrhs{p)  =  (<?',*')} 

If  the  sources  of  a  rule  refer  to  input  paths  that  do  not 
exist  in  the  input,  then  the  rule  cannot  apply  (because 
a  J,  (t  •  (1)  •  i')  would  not  exist).  In  the  traditional  state¬ 
ment  of  R,  sources{rhs)  is  always  {(1), . . . ,  (n)},  writ¬ 
ing  Xi  instead  of  (i),  but  in  xR,  we  identify  mapped  input 
subtrees  by  arbitrary  (finite)  paths. 

An  input  tree  is  transformed  by  starting  at  the  root 
in  the  initial  state,  and  recursively  applying  output¬ 
generating  rules  to  a  frontier  of  (copies  of)  input  subtrees 
(each  marked  with  their  own  state),  until  (in  a  complete 
derivation,  finishing  at  the  leaves  with  terminal  rules)  no 
states  remain. 

Let  wx{a,b)  follow  from  ex¬ 

actly  as  in  Section  3.  Then  the  weight  of  {i,  o)  in  X 
is  Wx{i,o)  =  wx{Qi{i),  o).  The  weighted  tree  trans¬ 
duction  given  by  X  is  Xx  =  {{i,o,w)  G  Te  x  Ta  x 
(l  o)  =  w}. 

5  Parsing  a  Tree  Transduction 

Derivation  trees  for  a  transducer  X  =  (S,  A,  Q,  Qi,  R) 
are  trees  labeled  by  rules  {R)  that  dictate  the  choice  of 
rules  in  a  complete  A-derivation.  Figure  3  shows  deriva¬ 
tion  trees  for  a  particular  transducer.  In  order  to  generate 


derivation  trees  for  X  automatically,  we  build  a  modified 
transducer  X' .  This  new  transducer  produces  derivation 
trees  on  its  output  instead  of  normal  output  trees.  X'  is 
{T,,R,Q,Qi,R'),  with 

R'  =  {{q, pattern, rule{yieldrhs{Q  x  paths)),  w)  \ 

rule  =  [q,  pattern, rhs,w)  G  R} 


That  is,  the  original  rhs  of  rules  are  flattened  into  a 
tree  of  depth  1 ,  with  the  root  labeled  by  the  original  rule, 
and  all  the  non-expanding  A-labeled  nodes  of  the  rhs  re¬ 
moved,  so  that  the  remaining  children  are  the  nonterminal 
yield  in  left  to  right  order.  Derivation  trees  deterministi¬ 
cally  produce  a  single  weighted  output  tree. 

The  derived  transducer  X'  nicely  produces  derivation 
trees  for  a  given  input,  but  in  explaining  an  observed 
(input/output)  pair,  we  must  restrict  the  possibilities  fur¬ 
ther.  Because  the  transformations  of  an  input  subtree 
depend  only  on  that  subtree  and  its  state,  we  can  (Al¬ 
gorithm  1)  build  a  compact  wRTG  that  produces  ex¬ 
actly  the  weighted  derivation  trees  corresponding  to  X- 
transductions  (/,())  {0,h)  (with  weight  equal  to 

wx{h)). 

6  Inside-Outside  for  wRTG 

Given  a  wRTG  G  =  {T,,  N,  S,  P),  we  can  compute 
the  sums  of  weights  of  trees  derived  using  each  produc¬ 
tion  by  adapting  the  well-known  inside-outside  algorithm 
for  weighted  context-free  (string)  grammars  (Lari  and 
Young,  1990). 

The  inside  weights  using  G  are  given  by  Pg  '■  Ts  — > 
(R  — IR“),  giving  the  sum  of  weights  of  all  tree-producing 
derivatons  from  trees  with  nonterminal  leaves: 


Pcit)  = 


^  w  ■  Pair) 

{t,r,w)£P 

Pcilabeltip)) 

p^pathst  {N) 


ift  G  N 
otherwise 


By  definition,  Pg(S)  gives  the  sum  of  the  weights  of 
all  trees  generated  by  G.  For  the  wRTG  generated  by 
DERIV{X,  I,  O),  this  is  exactly  WxiG  O). 

Outside  weights  aa  for  a  nonterminal  are  the  sums  of 
weights  of  trees  generated  by  the  wRTG  that  have  deriva¬ 
tions  containing  it,  but  excluding  its  inside  weights  (that 
is,  the  weights  summed  do  not  include  the  weights  of 
rules  used  to  expand  an  instance  of  it). 


acin  G  N)  =  1  ifn  =  S,  else: 

uses  of  n  in  productions 

w-aain')-  Pdlabelrip')) 

p,{n'  ,r,w)^P:labelr-{p)='n  p'  ^pathsr-{N)  —  {p} 


Figure  3:  Derivation  trees  for  an  R  tree  transducer. 


sibling  nonterminals 


Algorithm  1;  DERIV 

Input;  xR  transducer  X  =  (Y,,  A,Q,Qi,  R)  and  ob¬ 
served  tree  pair  I  G  T-^,  O  G  T\. 

Output;  derivation  wRTG  G  =  {R,  N  C  Q  x  pathsj  x 
pathso,  S,  P)  generating  all  weighted  deriva¬ 
tion  trees  for  X  that  produce  O  from  I.  Returns 
false  instead  if  there  are  no  such  trees. 

begin 

5  ^  (Q„  0,  0),  iV  ^  0,  P  ^  0 

if  PRODUCE/  o(S^)  then 
|_  return  {R,  N,  S,  P) 

else 

|_  return  false 
end 

memoized PRODUCE j  o{q,  h  o)  returns  boolean  = 
begin 

anyrulel  ^  false 

for  r  =  {q,  pattern,  rhs,  w)  G  R  :  pattern{I  i  t)  = 

1  A  MATCHo,A(r/is,  o)  do 

(oi, . . . ,  o„)  ^  pathsrhsiQ  X  paths)  sorted  by 

Ol  ^lex  •  ■  ■  ^lex  On 

/In  =  0  if  there  are  none 
labelandrankderivrhsii))  ^  ir,n) 

for  j  ^  1  to  n  do 

^  labelrhsioj) 
c  ^  {q' ,  Oi) 

if  -iPRODUCE/ o(c)  then  next  r 

\_labelandrankderivrhs{{.j))  ^  (c,  0) 
anyrulel  ^  true 

_P  ^  P  U  {{{q,  i,  o),derivrhs,  w)} 

if  anyrulel  then  N  ^  N  U  {iq,i,o)} 
return  anyrulel 
end 

MATCH t ^^(t' ,p)  =  Vp'  S  path(t')  :  label(t',p')  G 
S  labelandranktfp')  =  labelandrankt{p  •  p') 

The  possible  derivations  for  a  given 
PRODUCEio  [q,  i,  o)  are  constant  and  need  not  be 
computed  more  than  once,  so  the  function  is  memoized. 
We  have  in  the  worst  case  to  visit  all  \Q\  ■  |/|  •  \0\ 

{q,  i,  o)  pairs  and  have  all  \R\  transducer  rules  match  at 
each  of  them.  If  enumerating  rules  matching  transducer 
input-patterns  and  output-subtrees  has  cost  L  (constant 
given  a  transducer),  then  DERIV  has  time  complexity 


Finally,  given  inside  and  outside  weights,  the  sum 
of  weights  of  trees  using  a  particular  production  is 
7G((n,  r,  w)  G  P)  =  acin)  ■  w  ■  fdair). 

Computing  aa  and  Pa  for  nonrecursive  wRTG  is  a 
straightforward  translation  of  the  above  recursive  defi¬ 
nitions  (using  memoization  to  compute  each  result  only 
once)  and  is  C>(|G|)  in  time  and  space. 

7  EM  Training 

Estimation-Maximization  training  (Dempster,  Laird,  and 
Rubin,  1977)  works  on  the  principle  that  the  corpus  like¬ 
lihood  can  be  maximized  subject  to  some  normalization 
constraint  on  the  parameters  by  repeatedly  (1)  estimating 
the  expectation  of  decisions  taken  for  all  possible  ways  of 
generating  the  training  corpus  given  the  current  parame¬ 
ters,  accumulating  parameter  counts,  and  (2)  maximizing 
by  assigning  the  counts  to  the  parameters  and  renormal¬ 
izing.  Each  iteration  is  guaranteed  to  increase  the  like¬ 
lihood  until  a  local  maximum  is  reached. 

Algorithm  2  implements  EM  xR  training,  repeatedly 
computing  inside-outside  weights  (using  fixed  transducer 
derivation  wRTGs  for  each  input/output  tree  pair)  to  ef¬ 
ficiently  sum  each  parameter  contribution  to  likelihood 
over  all  derivations.  Each  EM  iteration  takes  time  linear 
in  the  size  of  the  transducer  and  linear  in  the  size  of  the 
derivation  tree  grammars  for  the  training  examples.  The 
size  of  the  derivation  trees  is  at  worst  0(|(5|-|/|-|0|-|i?|). 
Eor  a  corpus  of  K  examples  with  average  input/output 
size  M,  an  iteration  takes  (at  worst)  0{\Q\  •  |i?|  •  if  •  M^) 
time — quadratic,  like  the  forward-backward  algorithm. 

8  Tree-to-String  Transducers  (xRS) 

We  now  turn  to  tree-to-string  transducers  (xRS).  In  the 
automata  literature,  these  were  first  called  generalized 
syntax-directed  translations  (Aho  and  Ullman,  1971)  and 
used  to  specify  compilers.  Tree-to-string  transducers 
have  also  been  applied  to  machine  translation  (Yamada 
and  Knight,  2001;  Eisner,  2003). 

We  give  an  explicit  tree-to-string  transducer  example 
in  the  next  section.  Eormally,  a  weighted  extended-lhs 
root-to-frontier  tree-to-string  transducer  Tf  is  a  quintuple 
{Y,  A,Q,Qi,  R)  where  Y  is  the  input  alphabet,  and  A 
is  the  output  alphabet,  Q  is  a  finite  set  of  states,  Qi  G 
Q  is  the  initial  (or  start,  or  root)  state,  and  R  G  Q  x 
XRPAT^  X  (A  U  (Q  X  paths))*  x  R+  are  a  finite  set  of 
weighted  transformation  rules,  written  (q,  pattern)  — 
rhs.  A  rule  says  that  to  transform  (with  weight  w)  an 
input  subtree  matching  pattern  while  in  state  q,  replace 
it  by  the  string  of  rhs  with  its  nonterminal  (Q  x  paths) 
letters  replaced  by  their  (recursive)  transformation. 

xRS  is  the  same  as  xR,  except  that  the  rhs  are  strings 
containing  some  nonterminals  instead  of  trees  containing 
nonterminal  leaves  (so  the  intermediate  derivation  objects 


Algorithm  2;  TRAIN 


Input;  xR  transducer  X  =  (E,  A,  Q,  Qd,  R),  observed 
weighted  tree  pairs  T  £  T's  x  Ta  x  R+,  normal¬ 
ization  function  Z{{countr  \  r  £  R},r'  £  R), 
minimum  relative  log-likelihood  change  for  con¬ 
vergence  e  £  R.’*',  maximum  number  of  iterations 
maxit  £  N,  and  prior  counts  (for  a  so-called 
Dirichlet  prior)  {prior ^  \  r  £  R}  for  smoothing 
each  rule. 

Output:  New  rule  weights  W  =  {wr  \  r  £  R}. 

begin 

for  (i,  o,w)  £T  do 

^i,o  ^ 

DERIV{X,i,o)llAlq.  1 
if  di^o  =  false  then 
T  -  {(i,o,w)} 

_  warn(more  rules  are  needed  to  explain  (i,o)) 
compute  inside/outside  weights  for  di^o  and 
remove  all  useless  nonterminals  n  whose 

\_lddi,o(n)  =  Ooradi.o(n)  =  0 

itno  ^  0,  lastL  < - oo,  6  ^  e 

for  r  =  (q,  pat,  rhs,  w)  £  R  do  Wr  ^  w 
while  S  >  €  A  itno  <  maxit  do 
for  r  £  Rdo  counR  ^  priorr 
L^O 

for  (z,  O,  W example')  ^  ^ 

//Estimate 

do 

let  D  =  =  (R,  N,  S,  P) 

compute  aD,l3D  using  latest 
W  =  {wr\r£R} 

//see  Section  6 
for  prod  =  {n,  rhs,  w)  £  P  do 
quiprod)  ^  aoin)  ■  w  ■  Poirhs) 
let  rule  =  labelrhsiO) 

COUntrule  ^  aOUfltruleP'OJ  ex  ample  ‘ 

_L  ^  R  ~h  log  Pd  (S)  •  Wexample 

for  r  =  {q,  pattern,  rhs,  w)  £  R 
//Maximize 

do 

countr 

'Uj^  i _  _ 

Z({countr\r  £  R},r) 

He  .q  .  Z{{q,a,b,c))  =  countr 

r—{q,d,e,f)GR 


L  —  lastL 


lastL 


\L\ 

-  L,  itno 


itno  -f  1 


end 


are  strings  containing  state-marked  input  subtrees).  We 
have  developed  an  xRS  training  procedure  similar  to  the 
xR  procedure,  with  extra  computational  expense  to  con¬ 
sider  how  different  productions  might  map  to  different 
spans  of  the  output  string.  Space  limitations  prohibit  a 
detailed  description;  we  refer  the  reader  to  a  longer  ver¬ 
sion  of  this  paper  (submitted).  We  note  that  this  algo¬ 
rithm  subsumes  normal  inside-outside  training  of  PCFG 
on  strings  (Lari  and  Young,  1990),  since  we  can  always 
fix  the  input  tree  to  some  constant  for  all  training  exam¬ 
ples. 

9  Example 

It  is  possible  to  cast  many  current  probabilistic  natural 
language  models  as  R-type  tree  transducers.  In  this  sec¬ 
tion,  we  implement  the  translation  model  of  (Yamada 
and  Knight,  2001).  Their  generative  model  provides 
a  formula  for  P(Japanese  string  |  English  tree),  in  terms 
of  individual  parameters,  and  their  appendix  gives  spe¬ 
cial  EM  re-estimation  formulae  for  maximizing  the  prod¬ 
uct  of  these  conditional  probabilities  across  the  whole 
tree/string  corpus. 

We  now  build  a  trainable  xRS  tree-to-string  transducer 
that  embodies  the  same  P(Japanese  string  |  English  tree). 
Eirst,  we  need  start  productions  like  these,  where  q  is  the 
start  state: 

-  q  x:S  ^  q.TOP.S  X 

-  q  x:  VP q.TOP.VP  X 

These  set  up  states  like  q.TOP.S,  which  means  “translate 
this  tree,  whose  root  is  S.”  Then  every  q.parent. child  pair 
gets  its  own  set  of  three  insert-function-word  productions, 
e.g.: 

-  q.TOP.S  X  ^  i  X,  r  X 

-  q.TOP.S  X  ^  r  X,  i  X 

-  q.TOP.S  x^rx 

-  q.NP.NN  x  ^  i  X,  r  X 

-  q.NP.NN  X  ^  r  X,  i  X 

-  q.NP.NN  X  ^  r  X 

State  i  means  “produce  a  Japanese  function  word  out  of 
thin  air.”  We  include  an  i  production  for  every  Japanese 
word  in  the  vocabulary,  e.g.: 

-  i  X  — ^  de 

-  i  X  — >  kuruma 

-  i  X  — >  wa 

State  r  means  “re-order  my  children  and  then  recurse.” 
Eor  internal  nodes,  we  include  a  production  for  ev¬ 
ery  parent/child-sequence  and  every  permutation  thereof, 
e.g.: 

-  r  NP(xO:CD,  xl:NN)  ^  q.NP.CD  xO,  q.NP.NN  xl 

-  r  NP(xO:CD,  xl:NN)  ^  q.NP.NN  xl,  q.NP.CD  xO 

The  rhs  sends  the  child  subtrees  back  to  state  q  for  re¬ 
cursive  processing.  However,  for  English  leaf  nodes,  we 
instead  transition  to  a  different  state  t,  so  as  to  prohibit 
any  subsequent  Japanese  function  word  insertion; 

-  r  NN(xO:car)  ^  t  xO 


-  r  CC(xO:and)  t  xO 

State  t  means  “translate  this  word,”  and  we  have  a  produc¬ 
tion  for  every  pair  of  co-occurring  English  and  Japanese 
words: 

-  t  car  ^  kuruma 

-  t  car  ^  wa 

-  t  car  ^  *e* 

This  follows  (Yamada  and  Knight,  2001)  in  also  allowing 
English  words  to  disappear,  or  translate  to  epsilon. 

Every  production  in  the  xRS  transducer  has  an  associ¬ 
ated  weight  and  corresponds  to  exactly  one  of  the  model 
parameters. 

There  are  several  benefits  to  this  xRS  formulation. 
Eirst,  it  clarifies  the  model,  in  the  same  way  that  (Knight 
and  Al-Onaizan,  1998;  Kumar  and  Byrne,  2003)  eluci¬ 
date  other  machine  translation  models  in  easily-grasped 
FST  terms.  Second,  the  model  can  be  trained  with 
generic,  off-the-shelf  tools — versus  the  alternative  of 
working  out  model-specific  re-estimation  formulae  and 
implementing  custom  training  software.  Third,  we  can 
easily  extend  the  model  in  interesting  ways.  Eor  exam¬ 
ple,  we  can  add  productions  for  multi-level  and  lexical 
re-ordering: 

-  r  NP(xO:NP,  PP(IN(of),  xl:NP))  ^  q  xl,  no,  q  xO 
We  can  add  productions  for  phrasal  translations: 

-  r  NP(JJ(big),  NN(cars))  — >  ooki,  kuruma 

This  can  now  include  crucial  non-constituent  phrasal 
translations: 

-  r  S(NP(PRO(there),VP(VB(are),  xO:NP)  ->  q  xO,  ga, 
arimasu 

We  can  also  eliminate  many  epsilon  word-translation 
rules  in  favor  of  more  syntactically-controlled  ones,  e.g.: 

-  r  NP(DT(the),xO:NN)  ->  q  xO 

We  can  make  many  such  changes  without  modifying  the 
training  procedure,  as  long  as  we  stick  to  tree  automata. 

10  Related  Work 

Tree  substitution  grammars  or  TSG  (Schabes,  1990) 
are  equivalent  to  regular  tree  grammars.  xR  transduc¬ 
ers  are  similar  to  (weighted)  Synchronous  TSG,  except 
that  xR  can  copy  input  trees  (and  transform  the  copies 
differently),  but  does  not  model  deleted  input  subtrees. 
(Eisner,  2003)  discusses  training  for  Synchronous  TSG. 
Our  training  algorithm  is  a  generalization  of  forward- 
backward  EM  training  for  finite-state  (string)  transducers, 
which  is  in  turn  a  generalization  of  the  original  forward- 
backward  algorithm  for  Hidden  Markov  Models. 
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